Contributions

All members of the group participated equally throughout the preparation of this project assignment, which includes both the webpage and the explainer notebook. Felix's main responsibility was carring out the temporal and the network analysis. Christian's responsibility has been carring out the text analysis using TF-IDF and lexical dispersions plot in addition to the network analysis. Daniel's responsibility has been to prepross the data, carrying out the sentiment analysis and basic stats and properties of the graphs.

Motivation

What is your dataset?

The dataset for this project consist of the corpus containing the parliamentary debates over the time periode of 05-01-2015 to 01-03-2021 in the British Parliament. The dataset is collected from the publicly avaliable debates and metadata provided British government. The original intent was to investigate what effect COVID had in a parliamentary context. Each speech have a marked-up transcript containing comments, like long gaps, applause, interrupts and ect.

Why did you choose this/these particular dataset(s)?

The political system itself is built upon a very complicated system with a lot of different actors that all have their own agendas. A place where key figures of this system clash, discuss their idea and try to enforce their ideology and political beliefs is the Parliament. Therefore this project will try to explore these complex networks of interactions. This will mainly be done through a toolbox consisting of text analysis of the PM's statements in the British Parliament and network analysis.

What was your goal for the end user's experience?

Our main goal was to demonstrate to the user that politics also can be investigated using mainly data-driven approaches while keeping the reader intrigued by having interactive plots that nudge the user to dive deeper and explore the different aspects of the analysis.

Basic stats. Let's understand the dataset better

Write about your choices in data cleaning and preprocessing

First step is to collect and represent the data in a meaningful datastructure that allows for further analysis. For this project all data is collected in a single Pandas dataframe.

The raw data itself is spread over six subfolder, one for each year. In these folders there are two files for each day. One that contains the transcripts themself and another file that contains the corresponding metadata. Each of these files have a ID column that allows to match the rows of the two data files.

Now the two dataframes are concatenated together with their column names.

Write a short section that discusses the dataset stats (here you can recycle the work you did for Project Assignment A)

In the following section a summary of the key values and characteristics are presented.

First step is to load in the data that was collected in the previous steps.

Next step is exploration of the size of the dataset

Here each datapoints should be understood as a MP's statement, while the next row in the dataset is the following MP's/speakers statement.

Next step is to explore if there are any redunctant or dirty data that needs to be filter out before further analysis.

From the counts of each columns it is seen that the columns Session, Meeting, Sitting , Agenda only contains NaN-values and therefore can be excluded.

Here it is seen that the Speaker Birth column only contains "-" and not the actual birthday of the speaker. Therefore this column is also excluded.

As seen from the above result the lastest statement coming from a member attributed to a member of the coalition is index 27461. This means that the last date a MP is attrubuted to the coalition is the 26-03-2015 and the next 6 years MPs are only noted as Opposition. Therefore this column is also removed.

Next step is the found out how many unique speakers and the frequency of how many speeches they make.

The following plot shows the top ten speakers.

From this plot it is apparent that the number of addresses each MP makes follows a power law. Prominent politicians are required to be more active in the houses as seen from the top ten speaker that consist of 2 speakers of the house, 2 prime ministers and the rest being ministers.

Next up is a exploration of the activity in the two houses.

There is a clear distinction in the activity of the two houses. A hypothesis of why is that the house of common is elected each term and the primary goal of this house is to debate political topics and propose new laws. The lords are appointed and their primary goals are shaping laws and challenging the work of the government.

Based exclusively on the number of adresses it seems that the activities in the houses still are highly male dominated.

The ten characteristics of big data

After this initial analysis of the data and its structure some of the ten characteristics of big data becomes apparent. In this section some of the most relevant characteristics are explored and used to highlight important distinctive features of the dataset.

Sentiment analysis

For the basis of the sentiment analysis of the speeches in parliament, a dictionary-based sentiment analysis was applied. Here a predefined dictionary with happiness scores associated with the 10222 most commonly used words in the English dictionary is used to explore if sentiment analysis can highlight any underlying structures of the British parliament.

Firstly the Hedonmeter file containing the happiness scores is loaded in.

To see the distribution of happiness scores a histogram is used.

The skewness measure indicate that the distribution have a moderately skewness to it. More specifically it shows that the distribution have a right sided tail.

Next step is the creation of two function:

The first function find_the_happiness is based upon the formula: $$h_{avg}(T)=\sum_{i=1}^{N}h_{avg}(w_i)p_i$$ where $p_i = \frac{f_i}{\sum_{j=1}^{N}f_j}$ and $f_i$ is the frequency of the i'th word $w_i$.

The second function preprocess takes a text string as input and tokenizes it, filtering out stop words and punctuation.

In the following steps the data is tokenized.

Since the data column describing if each speaker is a member of the coalition or opposition is filtered out the initial sentimental analysis is done on the two major parties from the coalition and the opposition. Namely the Labour party and the Conservative party.

Save the previous computations so they don't have to be repeated.

Load in the data if stored.

Since dictionary based methods need a large document size to work properly, it is seen from the above plot that especially the Labour document is on the limit of wheather the sentiment analysis yields a reliable result. With the Conservative document only a few days are below the recommended token size of 10000 words. This also reflect that over the entire period of the dataset, Conservative is the larger party.

Here some days of interest are highlighted.

As seen from the plots and the mean happiness score for each party there are no great fluctuations, either overtime or between the two parties. This is mainly attributed to the text being in a highly formal setting and the words are chosen reflect that. Since most MP make use of this language, though they highly disagree on topics, it pushes the sentiment score towards the neutral 5. It is interesting, however, that the sentiment score is still above 5 in this setting, again showing support for the Pollyanna hypothesis.

A correlation of 0.595 between the happiness scores for each party indicates that the sentiment in the parliament follows at some level the same tendencies. This makes sense since there is a natural linkage between the topics addressed by MP in the chronological order the data is presented. Many topics of bills discussed in the house poses a problem where parties differ on which solution is most appropriate.

Wordshifts

In the following section, the wordshifts for both the Conservative and Labour party will be explored.

Originally this part of the analysis was intended for the Party status attribute (Coalition, Opposition), as it was hypothesised to yield a more distinctive separation of central word use related to the sentiment around certain events. Instead, two major parties were chosen.

As a major event the first COVID lockdown in UK was chosen. To create the wordshifts two lists are defined, one containing all the words say that day in a single document and a reference list that looks back in time.

Now the same procedure is used for the Conservative Party.

Here it is seen that both parties are slightly down in happiness scores compared with the 14 days before it is referenced against. But what is interesting is perhaps not the score itself, but the change in word frequency that their wordshifts uncovers.

For negative words that emerge more than the usual, bill is at the top for both parties. This is a natural finding as legislators are proposing and discussing bills that can relieve the situation. Another series of negative words that are up are words describing the consequences of the pandemic, like "death", "emergency", "distancing", "crisis" etc.

The positive words (right side of the figure) tell how the politicians for the respective parties address this time of crisis and how they wish to inspire hope and unity in the population.

TF-IDF

The following section will make use of the text mining tool TF-IDF to try and distinguish important topics within different parties e.g. Liberals and Conservatives. Among other cool stuff that we will figure out on the fly.

First of all a new preprocess of the data is made, that in addition to remove stopwords also stems the tokens.

Data Fetch
Data manipulations

Each observation in the dataframe consist of a specific speach associated with metadata e.g. speaker name, speaker party etc. Initially, we will create one document for each party by concatenating all of the speaches conducted by the members of the same parties. This would yield a corpus consisting of:

We subtract the count with one given that one party simply is NaN, thereby resulting in 47 documents. Furthermore, it should also be noticed that some of the 47 documents do not exclusively consist of transcripts from one single party, but is transcripts that are assigned to multiple parties due to interruptions and debates from members of different parties.

We now have a document for each party in addition to some documents that consist of multiple parties. Thereby, resulting in a corpus of 47 documents. The next step is to compute the TF-IDF weightings of all terms in the corpus.

Due to the amount of data this implementation is very time consuming. Therefore, the rest of the analysis will make use of Sklearns vectorized TF-IDF implmentation. However, we will first demonstrate that the two implementations yield the same results.

Our implementation

Sklearn's implementation

By comparing the result from the two implementations, it can be seen that the estimated TF-IDF weights are equivalent. Also note that the technical details of how TF-IDF was implemented can be found in the doc string of our own implementation. The most important detail, is that we deviates from the traditional text book formulation by introducing smoothing and also added one to the IDF estimate.

In the following section, sklearn's implementation of TF-IDF will be used to determine the TF-IDF representation of the 47 documents. Subsequently, the representations will be visualized using WordClouds. Furthermore, the initial WordClouds demonstrated that the IDF weighting was not quite enough to alleviate the issue with the disproportional importance of frequent words. Therefore, an upper and lower threshold was introduced to remove words that are in more than 70 % of the documents and less than 10 % of the documents.

Party similarity (TF-IDF & Cosine similarity)

Once the term-document matrix with TF-IDF weighting have been determined, the cosine similarity between each document can be used to determine similarity between documents, and thereby between parties.

Given that sklearns TFIDF vectorizer already normalizes the rows using l2 norm, the cosine similarity can simply determined as follows:

Let's depict the similarity matrix using a heatmap:

Temporal Text Analysis

Bubble plots- lexical dispersion plots

The following section will display lexical dispersion plots of the topics in the British Parliament with a unique style. Notice that we had to alter the typical lexical dispersion plot slightly, primarily because most topics are to some degree always mentioned making it difficult to see the variation in the nltk implementation. The alteration consists of making the size of the marker proportional with the frequency.

In the following implementation, a document is created pr. day by joining the speeches into a single string for that specific day. Instead of using a regular expression pattern to find occurrences of the chosen term, we decided to tokenise the joined string and remove all tokens that are not equal to the term. The described procedure is repeated for each of the chosen terms. Consequently, a frequency pr. day for each term is computed.

Trend Plots

Below this cell, we will investigate some different temporal relations using plotly and raceplotly. We decided to collect dates into months because there was so much daily variation in topic mentions. For comments on these plots, we refer you to our website.

Network Analysis

Network Analysis of the British Parliament Speeches

Disclaimer: We have not run this part of the notebook a final time (because it would take too long). Instead, we have posted the images from our plots saved previously. - Sorry

The following section will attempt to analyse the political activity within the British Parliament by leveraging the results of the topic and text analysis of the speeches from the previous section combined with methods from the network science field.

The first step in the analysis is to construct a network using the speech transcripts and associated metadata, which can be found in the data section. One possible approach for constructed the network is to create a bipartite network with one partition being the members of the Parliament and the other partition being political topics discovered in the text analysis section. Hence, constructing a network that tries to model the relationship between politicians and political topics based on their speeches conducted in the Parliament. Furthermore, to model the relationship within each partition, it is possible to project the initial bipartite graph into two distinct network as depicted in the following visualization:

Thus, creating a network consisting of the members of the Parliament and a network consisting of the political topics. In the projected network with Parliament members, two members share an edge if they had a common neighbour in the bipartite network. Furthermore, the weight of each edge corresponds to the number of mutual neighbours of the two nodes in the bipartite network.

Load Data

Modelling

The first step is to create the bipartite network. The relations between members of the Parliament and the chosen topics were determined by assessing which topics a given member mentions sufficiently frequent in their speeches. This was accomplished by simply counting the frequency distribution of topics for each Parliament member and using one standard deviation as a threshold for determining when a topic is mentioned sufficiently frequent.

Hence, for each member of the Parliament, an edge is drawn from the member to a topic, if the frequency of the topic is more than one standard deviation away from the average topic frequency for that particular Parliament member.

In the following section the bipartite graph is constructing using the described approach. Subsequently, it is visualized using the netwulf library for graph visualizations.

This results in a bipartite graph with 1913 nodes and 3709 edges.

We created a colormap to use on most of our plots for consistency. The colormap is also used in the text-analysis part but because we created everything in parallel, this cell is where we defined it.

It can be seen that there are a few singletons present in the network. This is because of how we assign edges. If a parliament member (pm) talks equally much about all topics, there will not be a topic of which he/she talks more about than all the other topics. Therefore, our code will not assign an edge to a topic. It is, however, possible for a pm to have more than one edge assigned to them. These nodes can be seen as “bridges” between topics in the network. These bridges will thereby connect topics which actually groups similar topics together in the network. We can for example see see eu and legisl(ation) are connected by multiple bridges and are therefore placed near each other in the network.

We can also see topics related to health by this measure include test, pandem(ic), vaccin(e) and educ(ation).

Projected Networks

Subsequently, the projected network consisting of the members of the Parliament was determined, to model the interaction between Parliament members. We remove all singletons, because they would only create their own components, disconnecting the entire graph. The weighted projected graph is determined using networkx's implementation.

Image:

The annotations of each node are determined by using the edge weights in the original bipartite network and then annotating each Parliament member with the topic associated with the highest edge weight.

We see the same trends as we saw in the bipartite network: there tend to form groups/communities which contain similar topics. However, it is more visible from this plot which communities are formed. defenc(e) is closely related to world, tax to pension and legisl(ation), and eu to legisl(ation) and economi(cs). The plot also tells us something about pms and their neighbours. The closer the distance between two pms, the more topics of interest they share. This means that the labeled version of the plot tells us which politicians are similar to each other with regards to topic mentions (not opinions).

Graph Properties

Naturally, all pms adress all the topics to some degree. Therefore, the thresholding when defining the network have a high impact on the structure of the network. However, as seen from the degree histrogram for the projected network, each node still have a very high average degree meaning that the network is associated with a high inter- or intra-connectedness.

Image:

Returns 0.80

The clustering coefficent for the projected graph is: 0.80. The clustering coefficent can be seen as a measurement for local link density in the network. With a clustering coefficent of 0.80 it is implied that two neighbors of a certain node have 80 % chance of being connected. This is seen from the network visualisation that consists of many local, tightly connected clusters with bridges of pms between them that connects them into a global structure.

Community Detection

The following section will conduct community detection using the Louvain algorithm on the projected MP graph.

The Louvain algorithm detects four communities. The following bar plot will visualize the size of each community. We decided to use a barplot instead of a histogram due to the few numbers of communities, if many communities had been discovered a histogram would have been more appropriate and in the alignment of our usual procedure in the course.

Image:

The size of each cluster can be seen in the barplot above. The plot shows that the Louvain algorithm have detected four relatively large communities. This entails that according to modularity, the network structure is best explained by four large communities instead of multiple small communities.

Next the graph will be visualized:

Image:

It is again apparent that the communities are pretty well inter-connected, meaning that the communities are linked together with other communities very well. This can be seen in the middle of the network, where we have a mix of different communities. Interestingly, when comparing this plot with the previous network plot, it can be seen that some of the groups we discussed earlier have been clustered together by the Louvain algorithm. For instance, community 0 consists of, among others, health, pandem(ic), vaccin(e), and test.

Partition Comparison

The following section will compare the communities detected by the Louvain algorithm with the communities created by using the edge weight between members of the Parliament and the political topics. To conduct the partition comparison, we will use the normalized mutual information between the two partitions. Normalized mutual information can be formalized as the following:

$$I_{n}(X;Y)=\frac{I(X;Y)}{\frac{1}{2}H(X)+\frac{1}{2}H(Y)}$$

where X and Y are the partitions being compared and H(.) is the shanon entropy e.g.: $$H(Y)=-\sum_{x}p(y)log(p(y)))$$

Mutual information gives us an estimate of how much information one would gain regarding X if we know the variable Y. However, to be able to compare the mutual information score across different partitions which may have different sizes, the estimate has to be normalized. Furthermore, the shanon entropy H(X) is an estimate of information associated with the variable X.

Given that the Louvain algorithm is stochastic to a certain degree, the detected communities may differ based on initialization. Hence, to calculate the normalized mutual information (NMI) between the Louvain partition and the topic partition with an uncertainty measure, the Louvain algorithm was executed a thousand times while determining the NMI in each repetition. The average NMI between the two partitions with a 95% confidence interval was: $$mean(I(Louvain\ partition,\ Topics\ partition)) = 0.30 \pm 0.001$$

Here the mean value of the nromalized mutual inforamtion is found to be 0.301 with a confidence interval of (0.2994, 0.3026).

To assess whether or not the estimated normalized mutual information is significant, a randomization test will be conducted. The randomization is conducted by randomly shuffling the topic partition, and the normalized mutual information between the Louvain communities and the random partition is then computed. To achieve a measure of uncertainty, the described procedure is repeated a thousand times.

The partition comparison yields that there is some shared information between the topic partitions and the detected communities of the Louvain algorithm($NMI\approx0.28$). However, not all the information of one partition can be described using the other partition. Consequently, an investigation of the topics within the detected louvain communities will be conducted to investigate the communities further. A first approach is simply to compute the topic distribution within each community:

Images:

As mentioned before, community 0 consists of health-related topics such as vaccin(e), test and pandem(ic).

Subsequently, the frequency distribution of community 1 is mainly dominated by the topic legislation followed by EU which probably corresponds to the large mixed cluster in the previous network plot.

The topic frequencies within community 2 are more various, with the most frequent topics being world, education, tax, economy, defence and employment.

Lastly, the topic distribution of community 3 demonstrates that the members of the community are mainly associated with the topic EU.

Subsequently, TF-IDF as previously used in the text analysis section can be used to generate wordclouds related to each detected community. One feasible approach for computing the TF-IDF is to group the speeches of the Parliament members, who are within the same community as a single document. Thereby, resulting in four documents, one for each community. Furthermore, the IDF is computed on the entire corpus to achieve reasonable estimates. Subsequently, the TF-IDF representations of the four documents are computed with the IDF estimate, which is computed on the entire corpus.

Community 0 Community 1

Here we see words like health, care, servic(e) and state in the WordCloud for community 0 which sound like they fit the overall health care theme. Community 1 consists of words like legisl(ation), eu and law which also seems to fit the legislation / eu theme.

Community 2 Community 3

Community 2 is the mixed community, and that shows in its WordCloud. Words like import, countri(es), and local seem to fit topics like economi(cs), world, and educ(ation). Finally, we see community 3 which mostly contains the topic, eu. It contains words like eu, trade, vote and import.

The following section will try to compare the two partions modularity. We will initially use our own implementation. However, given that our own implementation does not take edge weight into account, we expect the estimated modularity to be misleading.

To take edge weight into account, we will use the modularity function provided in the Louvain API library, which corresponds to the modularity function used for detecting the communities.

The estimated modularities demonstrate that taking edge weights into account did not have a significant influence on the estimates. Therefore, the conclusion is still that the louvain communities are associated with a higher modularity, which entails that the detected community structure by the Louvain algorithm is more appropriate to the true nature of the network structure than the topic partition.

Discussion

The discussion commenting on the overall findings can be found on the webpage. The discussion here in the explainer notebook will mainly focus on the technical aspects of the analysis.

What went well?

The implementation of the text analysis did more or less go as planned, demonstrating that TF-IDF can be used to find descriptive words for each party. The temporal analysis demonstrated the evolution of topic frequencies using lexical dispersion plots and scatterplots of monthly frequencies with rolling averages. In terms of network modelling, it was feasible to model the dataset using bipartite and projection networks. Furthermore, community detection and partition comparison were conducted using the constructed networks, and TF-IDF was used to investigate what the communities address.

What is still missing? What could be improved?, Why?

An aspect that could be improved upon is sentiment analysis. Here different methods than the proposed dictionary based method could be explored. Especially sentiment analysis methods that do not require big document sizes, so local fluctuations in sentiment can be detected. Furthermore, it would be beneficial to conduct a further investigation of how to most suitably model the political activity as a graph. For instance, one aspect which could be further improved is how to determine the threshold of when a member mentions a topic frequently enough for a link to be created between the member and the topic. It could also be interesting to try a completely different approach for the network modelling, perhaps using directed instead of bipartite graphs. Our initial thought was to create edges between members of the Parliament who mentions each other as an attempt to model the relationships between Parliament members. However, we discovered that the members rarely address each other by name, and a different approach would therefore have to be used. Hence, we decided to use the projection of the bipartite graph, but it would be interesting to investigate if any other approaches could be feasible.